bench(compliance): scanner pipeline benchmarks with baselines#204
Merged
Conversation
Large repos (1h+ SCIP index) hit the 10h timeout because the load pipeline was fully serial with several quadratic/cubic bottlenecks. loader.go: - Replace serial document loop with parallel goroutine pool (GOMAXPROCS workers): convert + RefIndex + ContainerIndex built per-doc in parallel, merged serially after WaitGroup - Fix ContainerIndex O(occurrences×defScopes) → early-exit by sorting defScopes by size ASC so first containing match is always innermost - Parallelize ConvertedSymbols pre-computation across batched goroutines - Add DocumentsByPath map[string]*Document for O(1) GetDocument lookup (was O(n) linear scan through Documents slice) - Remove raw *scippb.Index field — set but never read, retained the full parsed protobuf in memory indefinitely fts.go: - convertSymbolToFTSRecord: replace O(N×M×occ) nested document scan with RefIndex lookup — O(avg_refs_per_symbol) instead of scanning all docs engine.go: - Run PopulateFTSFromSCIP in background goroutine; FTS is an optional optimization and searches already fall back to in-memory when FTS is unavailable, so blocking engine init on it had no benefit Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds performance benchmarks for the compliance scanner hot paths — normalizeIdentifier, extractIdentifiers, extractContainer, matchPII — which run on every identifier in every line across every file during an audit but had zero benchmark coverage. Key findings from the baseline run (Apple M4 Pro): - Pipeline throughput is flat ~8.3 MB/s regardless of file size (good) - AuditFileSet/5kfiles: 5.6s wall, 21M allocs, 631MB — scales linearly with file count, never amortizes; root cause is extractIdentifiers allocating a fresh map[string]bool per line - MatchPII_PatternScale confirms O(n) suffix scan: 1.17µs at 80 patterns, 5.27µs at 500 — custom patterns degrade all misses proportionally Committed baseline at testdata/benchmarks/compliance_baseline.txt. Compare after changes with: go test -bench=. -benchmem -count=6 ./internal/compliance/... > /tmp/after.txt benchstat testdata/benchmarks/compliance_baseline.txt /tmp/after.txt Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
🟢 Change Impact Analysis
Blast Radius: 0 modules, 0 files, 0 unique callers 📝 Changed Symbols (3)
Recommendations
Generated by CKB |
CKB Analysis
Risk factors: Moderate churn: 763 lines changed 👥 Suggested: @lisa.welsch1985@gmail.com (100%), @talantyyr@gmail.com (60%)
🎯 Change Impact Analysis · 🟢 LOW · 3 changed → 0 affected
Symbols changed in this PR:
Recommendations:
💣 Blast radius · 0 symbols · 1 tests · 0 consumersTests that may break:
📊 Complexity · 3 violations
💡 Quick wins · 10 suggestions
📚 Stale docs · 160 broken references
Generated by CKB · Run details |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #204 +/- ##
=======================================
Coverage 43.0% 43.0%
=======================================
Files 507 507
Lines 77953 78022 +69
=======================================
+ Hits 33568 33614 +46
- Misses 42022 42045 +23
Partials 2363 2363
Flags with carried forward coverage won't be shown. Click here to find out more. 📢 Thoughts on this report? Let us know! 🚀 New features to boost your workflow:
|
|
CKB review failed to generate output. |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
CKB review failed to generate output. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Adds performance benchmarks for the compliance scanner — the innermost loop of every
ckb audit compliancerun — which had zero benchmark coverage. Also commits a reproducible baseline and documents every efficiency issue found during the investigation.Files changed:
internal/compliance/scanner_bench_test.go— 10 benchmark functionstestdata/benchmarks/compliance_baseline.txt— committed 5-run baseline (Apple M4 Pro, arm64)Architecture context
During
ckb audit compliance,RunAuditfans out up to 131 checks in parallel (capped at 32 workers). The per-line call chain during a PII scan is:On a large repo this chain runs tens of millions of times across the check fan-out.
Benchmark results (Apple M4 Pro, arm64, -count=5)
Individual hot paths
normalizeIdentifier(typical)normalizeIdentifier(60-char)extractIdentifiersextractContainerisNonPIIIdentifiermatchPII(hit)matchPII(full suffix scan / miss)NewPIIScannerOn the zeros:
matchPIIandisNonPIIIdentifierallocate nothing in their inner paths. Zero is correct — map lookups and string comparisons stay on the stack.Pipeline scale (single file)
Throughput is flat ~8.3 MB/s regardless of file size — the work is strictly O(lines). But 14 allocs/line is constant. Every line allocates regardless of content.
File-set scale (full audit simulation)
Perfectly linear — no amortization. On a 20k-file monorepo with all frameworks enabled, multiple PII-checking frameworks (GDPR, HIPAA, CCPA, ISO27701) each invoke the PII scanner independently, so the real allocation count is a multiple of these figures.
Pattern count scaling
Exactly O(n). Every miss touches every pattern. Custom patterns from
.ckb/config.jsondegrade all non-matching identifiers proportionally.Efficiency issues found — ranked by impact
1.
regexp.MustCompileinside the per-line inner loop — CRITICALThree safety-framework checks compile a new regex on every line of every file:
All three have the identical pattern:
regexp.MustCompileparses and compiles the regex each call. For a 10k-line file with 500 functions, this compiles 10k–500k regexes per check invocation. The pattern only changes whencurrentFuncchanges (at a new function boundary), so it should be compiled once per detected function, not once per line.Fix: move the compile outside the line loop, keyed on
currentFunc:Or even simpler: use
strings.Contains(line, currentFunc+"(")— the regex only needs to match a word boundary and\s*(, whichstrings.Containscan approximate with a quick check before a regex fallback.2. 18 independent
NewPIIScanner+ScanFilescalls per audit — HIGHEvery PII-related check creates its own scanner and walks every source file:
Each
ScanFilescall opens every.go/.ts/.py/...file in the repo withos.Open, reads it line by line, and runs the full normalize+match pipeline. A 5k-file repo scanned 18 times = 90k redundant file opens with 90k redundant pipeline passes.Since all 18 calls use the same
scope.Config.PIIFieldPatterns, the result is identical every time. TheScanScopeis shared across all checks but has no result cache.Fix: compute PII fields once, store on
ScanScope:Expected reduction: 17 of 18 full file scans eliminated. On a 5k-file audit this is the difference between 5.6s × 18 and 5.6s × 1 for PII scanning alone.
3.
gdpr/retention.goreads files twice per check — HIGHnoRetentionPolicyCheck.Run(retention.go:22) callspiiScanner.ScanFilesto find PII fields (opens all files once), then immediately callsos.ReadFileon every file again to scan for retention indicators (line 49):strings.ToLower(string(content))on large files allocates a full copy of the file contents. Three more checks in retention.go repeat the same double-open pattern (lines 156, 215, 384).Fix: check for retention indicators during the initial PII scan pass (they can share the same line-by-line iteration), or cache file contents on
ScanScopefor the first read.4.
nonPIISuffixesslice allocated on everyisNonPIIIdentifiercall — MEDIUMA 46-element string slice is allocated on the heap every time this function is called.
isNonPIIIdentifieris called frommatchPIIfor every identifier that hits the name-matching path — potentially millions of times per audit.Additionally, iterating 46 suffixes linearly when most calls return
falseafter the full scan is O(46) per call. This is measurable in the benchmark at 197 ns/op, but the allocation is the worse issue.Fix: move to a package-level
map[string]boolbuilt at init time:Then the check becomes O(1) with no allocation:
return nonPIISuffixSet[normalized](with a suffix-split for theHasSuffixcase).5.
extractIdentifiersallocates a freshmap[string]boolper line — MEDIUMAt 14 allocs/line and ~6,000 allocs for 500 lines, this map is the dominant allocation source. For a 5k-file audit it produces ~18M allocations just from this map.
Fix: pass a
map[string]boolin from the caller (created once per file, cleared between lines):Go's map-clear idiom (
for k := range m { delete(m, k) }) reuses the underlying hash table memory without resizing. This would cut 18M allocs to ~5k (one per file) for the same audit.6.
matchPIIsuffix scan is O(n_patterns) — MEDIUMEvery identifier that doesn't exactly match (the common case) triggers a linear scan across all patterns. Already measured: 1.13µs for the default ~80 patterns, scaling to 5.27µs at 500. With 18 independent scanner instances, custom patterns degrade all 18 passes.
Fix: at
NewPIIScannerconstruction time, build a suffix → pattern index:When checking
user_email_address, extract the last word (address) and look it up directly. O(1) instead of O(n_patterns).7.
normalizeIdentifierrune allocations — LOWFour heap allocations per call:
[]runeconversion,result []runeappend,string(result)conversion, and thestrings.ReplaceAllloop for__collapsing. The loop is bounded (SCREAMING_SNAKE identifiers can have consecutive underscores) but allocates a new string on each iteration.The 138 ns/op baseline is acceptable for typical identifiers. The 650 ns/op for long (60-char) identifiers is where the rune slice capacity grows. Not a priority fix unless
normalizeIdentifieris called on very long synthetic identifiers, but worth noting for the suffix-index approach above (which would callnormalizeIdentifierless often).8.
bufio.Scannerdefault 64 KB line limit — LOWAll file scanning (scanner.go, ScanFileLines) uses
bufio.NewScanner(f)with the default 64 KB max token size. This causes silent truncation on lines longer than 64 KB — possible in minified JS, generated protobuf, or large raw string literals. Not a performance issue but a correctness risk on generated code.Summary table
regexp.MustCompileper linesync.Oncecache toScanScopeToLowercopynonPIISuffixesallocated per callisNonPIIIdentifiercallextractIdentifiersmap per linematchPIIO(n_patterns) suffix scannormalizeIdentifierrune allocsbufio.Scanner64 KB limitscanner.Buffer()callHow to compare after optimizations
The
BenchmarkAuditFileSetbenchmarks are the best signal for the high-impact fixes — they will show the improvement from fix #2 (PII scan deduplication) most clearly, since they simulate the multi-file pass that dominates real audit time.Test plan
go test -run='^$' -bench='^$' ./internal/compliance/...— compile check passesbenchstat